Parametric Modelling of Multivariate Count Data Using Probabilistic Graphical Models
نویسندگان
چکیده
Multivariate count data are defined as the number of items of different categories issued from sampling within a population, which individuals are grouped into categories. The analysis of multivariate count data is a recurrent and crucial issue in numerous modelling problems, particularly in the fields of biology and ecology (where the data can represent, for example, children counts associated with multitype branching processes), sociology and econometrics. Denoting by K the number of categories, multivariate count data analysis relies on modelling the joint distribution of the K-dimensional random vector N = (N0, . . . , NK−1) with discrete components. We focus on I) Identifying categories that appear simultaneously, or on the contrary that are mutually exclusive. This is achieved by identifying conditional independence relationships between the K variables; II)Building parsimonious parametric models consistent with these relationships; III) Characterising and testing the effects of covariates on the distribution of N , particularly on the dependencies between its components. To achieve these goals, we propose an approach based on graphical probabilistic models (Koller & Friedman, 2009) to represent the conditional independence relationships in N , and on parametric distributions to ensure model parsimony. Three kinds of graphs are usually considered: either undirected (UG), directed acyclic (DAG), or partially directed acyclic (PDAG). Models and methods for graph identification were proposed in UGs, using frequencies to estimate the probabilities (so-called nonparametric estimation), using mutual information – see Meyer et al. (2008) and references therein. Under a multivariate Gaussian assumption, an approach based on a L1 penalisation (lasso) was proposed by Friedman et al. (2008), with some extension to Poisson distributions with log-linear models for dependencies (Allen & Liu, 2012). Specific models and methods were developed for DAGs. Most methods for graph identification in DAGs are based on exploring the set of possible graphs using some heuristic (e.g. hill climbing) and by scoring the visited graphs (e.g. using BIC), the graph with highest score being eventually selected – see Koller & Friedman (2009) for a review. The case of parametric models for PDAGs has been considered less often in the literature. A family of such models was proposed by Johnson & Hoeting (2011) using conditional Gaussian distributions, but the problem of graph identification was not addressed. Lee & Hastie (2012) addressed the problem of graph identification in graphical models with both continuous and discrete random variables, but in a restrictive setting of UGs with conditional Gaussian and multinomial distributions. Our context of application is characterised by zero-inflated, often right skewed marginal distributions. Thus, Gaussian and Poisson distributions are not a priori appropriate. Moreover, the multivariate histograms typically have many cells, most of which are empty. Consequently, nonparametric estimation is not efficient.
منابع مشابه
Mining and visualising ordinal data with non-parametric continuous BBNs
Data mining is the process of extracting and analysing information from large databases. Graphical models are a suitable framework for probabilistic modelling. A Bayesian Belief Net(BBN) is a probabilistic graphical model, which represents joint distributions in an intuitive and efficient way. It encodes the probability density (or mass) function of a set of variables by specifying a number of ...
متن کاملAn Introduction to Probabilistic Graphical Models
(2009): Probabilistic graphical models: principles and techniques. C. Bishop (2006): Pattern recognition and machine learning. (Graphical models chapter available online, as well as the figures — many are used in these slides after post-processing by Iain Murray and Frank Wood.) K. Murphy (2001): An introduction to graphical models. Modelling strategy is to assume the data was generated accordi...
متن کاملProbabilistic Data Analysis with Probabilistic Programming
Probabilistic techniques are central to data analysis, but different approaches can be difficult to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include hierarchical Bay...
متن کاملLearning Discrete Partially Directed Acyclic Graphical Models in Multitype Branching Processes
We address the inference of discrete-state models for tree-structured data. Our aim is to introduce parametric multitype branching processes that can be efficiently estimated on the basis of data of limited size. Each generation distribution within this macroscopic model is modeled by a partially directed acyclic graphical model. The estimation of each graphical model relies on a greedy algorit...
متن کاملGraphical models - methods for data analysis and mining
The best ebooks about Graphical Models Methods For Data Analysis And Mining that you can get for free here by download this Graphical Models Methods For Data Analysis And Mining and save to your desktop. This ebooks is under topic such as data mining with graphical models pdfsmanticscholar data mining with graphical models borgelt data mining with graphical models springer data mining with poss...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1312.4479 شماره
صفحات -
تاریخ انتشار 2013